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SRI’s  KNIFE  image  analysis  system  can  be  used  for  tracking  objects  and 
material  classes  from  one  image  to  another.  Variations  on  this  theme  are 
the  initial  acquisition  of  target  instances  from  database  signatures  and  the 
subsequent  acquisition  of  additional  instances  in  an  image  once  a' few  objects 
have  been  labeled.  Classification-based  tracking  is  facilitated  by  improved  color 
and  texture-energy  transforms.  KNIFE’s  labeling  and  partitioning  methods 
can  be  used  with  complex  targets,  and  are  relatively  unaffected  by  occlusions 
and  changes  in  object  appearance  during  tracking. 
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1  Introduction 


The  KNIFE  digital-image  analysis  system  is  a  software  assistant  for  extracting,  la¬ 
beling,  and  editing  such  regions  of  an  image  as  houses  and  vegetation  in  aerial  imagery. 
This  report  describes  tools  for  tracking  scene  objects  automatically  through  a  sequence  of 
images.  The  examples  emphasize  tracking  of  roads  and  roadside  objects  through  images 
acquired  from  an  autonomous  land  vehicle. 

Systems  that  track  objects  can  exploit  their  shape,  brightness,  color,  texture,  range,  or 
trajectory.  I  have  found  the  generalized  Hough  transform  [Sloan  80;  Ballard  81;  Laws  82] 
to  be  an  effective  approach  to  2-D-shape  tracking,  especially  when  targets  may  be  partially 
occluded  by  other  objects.  Correlative  matching,  particularly  hierarchical  or  coarse-fine 
matching  [Moravec  80;  Hannah  84,  85],  is  another  useful  technique  for  tracking  the  shape 
and  distinctive  markings  of  an  object,  especially  if  combined  with  a  robust  method  for 
rejecting  false  matches  [BoUes  81;  Fischler  81]. 

In  some  applications,  shape  alone  is  not  sufficiently  informative.  Road  tracking  is 
one  since  the  road  orientation  and  cross  section  can  change  significantly  between  succes¬ 
sive  images.  Shapes  of  roadside  objects  are  also  affected  by  perspective,  occlusion,  and 
shadow  effects.  New  objects  or  vegetational  patches  entering  the  field  of  view  must  also 
be  identified;  this  requires  reasoning  about  material  classes  as  well  as  regional  shape. 

A  type  of  material  can  often  be  identified  by  its  brightness,  color,  or  texture,  although 
sometimes  high-level  reasoning  about  shape  and  context  is  necessary.  My  KNIFE  program 
uses  local  pixel  or  neighborhood  properties  to  extract  regions  that  can  be  passed  on  to  other 
analytical  systems  [Fua  85ab,  86,  87ab].  It  can  either  extract  homogeneous  regions  and 
then  label  them  or  label  pixels  and  then  extract  connected  regions.  AU  such  partitioning 
operations  are  integrated  with  noise-filtering  techniques  that  prevent  textured  regions 
from  being  split  too  finely.  Segmentation  and  labeling  are  usually  done  autonomously, 
although  the  program  can  be  used  interactively  and  may  someday  be  controlled  directly 
by  a  sophisticated  reasoning  system  [Wesley  8.3,  84,  86;  Fischler  88;  Strat  88]. 

2  Knowledge  Representation 

The  heart  of  the  KNIFE  program  is  its  region  representation.  Most  recursive  seg- 
menters  build  a  tree  to  describe  how  large  regions  have  been  divided  into  smaller  ones, 
as  well  as  to  keep  track  of  hypothesized  subregions  during  partitioning.  The  problem 
that  arises  with  a  tree,  or  with  any  technique  based  on  strict  spatial  hierarchy,  is  that  it 
does  not  permit  arbitrary  merging  and  attachment.  The  merging  of  neighboring  regions 
from  different  parts  of  the  tree  will  produce  new  regions  of  uncertain  parentage.  It  also 
destroys  the  rectilinear  alignment  used  in  quadtree  representations,  forcing  abandonment 
of  the  homogeneous  tree  representation  if  further  partitioning  is  contemplated.  Trees 
pose  additional  difficulties  as  cognitive  representations  supporting  the  user  interface:  they 
sometimes  necessitate  unintuitive  inclusion  relationships  (such  as  grouping  a  mountain’s 
snowcapped  peak  with  sky  and  cloud  regions)  and,  after  merging,  do  not  show  how  regions 
were  extracted  or  how  they  differ  from  siblings  and  neighbors. 

KNIFE  uses  a  directed  acyclic  region  graph,  permitting  any  region  record  to  be  attached 
to  any  other  in  a  paxent/child  relationship.^  It  also  has  a  region  map  for  determining  re- 

’The  only  exception  is  the  whole-image  region,  which  has  no  parent,  and  a  pseudoregion  representing 
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gional  statistics  and  spatial  relationships.  The  graph  serves  as  a  semantic  net,  recording 
spectral  or  semantic  relationships  among  regions  (rather  than  the  partitioning  history). 
Disconnected  composite  regions — i.e.,  regions  containing  non  adjacent  subregions — are  per¬ 
mitted  and  are  particularly  useful  for  representing  material  types,  occluded  objects,  and 
collections  of  similar  or  related  objects.  A  road  obscured  by  telephone  poles,  for  instance, 
could  be  represented  in  two  ways:  as  one  set  of  subregions  linked  to  a  common  “road” 
composite,  and  as  another  set  linked  to  a  “telephone  pole”  composite.  A  region  can  belong 
to  more  than  one  such  set  without  there  being  an  imposed  spatial  or  semantic  hierarchy. 

Region  records  contain  both  simple  descriptors  (e.g.,  area)  and  pointers  to  histograms 
and  other  knowledge-base  objects.  Most  fields,  including  numeric  descriptors,  can  contain 
a  “null”  or  “unknown”  code  marking  information  that  is  to  be  computed  when  needed. 
Subroutines  that  access  these  fields  know  how  to  obtain  such  values  from  the  region 
map  or  graph,  and  will  attach  the  result  to  the  region  record  so  that  it  will  not  have 
to  be  recomputed  if  it  is  needed  again.  (They  also  null  descriptors  that  have  become 
incorrect  because  of  editing  operations  on  the  region  graph.)  This  “lazy  evaluation”  saves 
considerable  computation  and  storage,  although  it  does  cause  certain  analysis-aird-display 
operations  to  consume  varying  amounts  of  time,  depending  on  what  has  been  stored. 

From  one  standpoint,  the  entire  KNIFE  system  is  only  a  coordinated  tool  kit  of  sub¬ 
routines  for  manipulating  region  graphs  and  maps.  Just  as  a  mall-handling  program  allows 
a  sequence  of  messages  to  be  read  and  manipulated,  KNIFE  offers  the  capability  to  dis¬ 
play  and  manipulate  partitioned  images.  Its  library  of  subroutines  forms  a  language  for 
implementing  the  pseudo-English  commands  available  to  the  user.  A  great  deal  of  care 
has  gone  into  making  this  a  convenient  and  efficient  language,  although  much  remains  to 
be  done.  The  following  sections  describe  some  of  the  algorithms  that  make  KNIFE  an 
effective  tool  for  extracting,  identifying,  and  tracking  scene  objects. 

3  Preprocessing  Algorithms 

Local  properties  are  any  measurements  or  estimates  that  can  be  made  about  the  scene 
content  at  a  particular  pixel  location.  Intrinsic  scene  properties,  such  as  surface  albedo 
and  orientation,  are  generally  not  measurable,  so  empirical  properties  such  as  local  image 
brightness  and  gradient  must  be  used.  Useful  properties  are  those  that  are  characteristic 
of  particular  material  types  or  that  change  very  slowly  (spatially  and  temporally)  for  any 
particular  object.  Different  local  properties  may  be  useful  for  different  objects  or  parts 
thereof. 

Tracking  systems  may  use  radar,  range,  thermal  infrared,  or  other  imagery  with  special 
characteristics  that  dictate  particular  methods  of  analysis.  I  shall  assume  that  the  imagery 
to  be  analyzed  here  is  typical  monochrome  or  color/multispectral  imagery,  or  can  be 
treated  as  such.  My  goal  is  to  extract  as  much  useful  information  as  possible  without 
requiring  task-specific  or  target- specific  knowledge.  Semantic  target-recognition  criteria, 
such  as  shape,  context,  and  trajectory  dynamics,  could  be  added. 

Monochrome  imagery  may  be  adequate  for  simple  tracking  tasks  like  following  a  plane 
against  the  sky,  or  for  simple  object  extraction  tasks  like  identifying  suburban  houses  by 
their  bright  roofs.  Changing  imaging  conditions  across  a  scene  or  among  scenes  make  it 

the  scene  area  just  outside  the  image. 
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difficult  to  distinguish  more  than  a  few  materials,  but  distinctive  classes,  such  as  cement, 
asphalt,  and  soil,  can  often  be  separated  by  brightness  alone. 

In  complex  scenes,  color  or  multispectral  data  may  be  useful  enough  to  justify  the  extra 
cost  of  collection,  transmission,  storage,  and  processing.  When  color  is  not  available, 
it  can  often  be  replaced  by  texture  bands  computed  from  monochrome  data — at  least 
for  identifying  pixels  far  from  region  borders.  The  KNIFE  program  currently  treats  all 
data  bands  alike  and  will  perform  an  analysis  with  as.  few  or  as  many  data  bands  as  are 
available.  It  performs  most  efficiently,  however,  when  each  band  encodes  information  that 
is  not  readily  available  from  the  others. 


3.1  Color  Representation 

Digital  color  imagery  is  typically  gathered  in  red-green-blue  (RGB)  form,  with  256 
possible  luminance  levels  per  pixel  in  each  band.  Often  the  bands  are  uncaJibrated  and 
individually  stretched  or  scaled,  making  it  difficult  to  identify  material  classes  by  using  a 
database  of  reference  signatures.  Fortunately,  this  has  little  effect  on  tracking  of  object 
instances  from  one  image  to  another  (or  to  another  portion  of  the  same  image).  1  shall 
assume  that  the  RGB  cube  is  a  vector  space  with  origin  at  (0,0,0),  and  shall  develop 
methods  of  histogram- based  analysis  that  are  insensitive  to  consequent  hue  instabilities. 

RGB  bands  of  natural  scenes  tend  to  be  highly  correlated  and  hence  redundant.  I 
have  found  intensity-hue-saturation  (IHS)  representations  more  useful  for  partitioning  and 
tracking  operations.  The  exact  transform  employed  makes  little  difference,  but  results  are 
best  if  computed  saturation  is  near  zero  for  all  colors  near  the  achromatic  axis.  (The 
chromaticity  saturation  formula  most  commonly  used  in  computer  vision  [Tenenbaum  74; 
Ohlander  75,  78;  Render  76,  77;  Price  76], 


5  =  5'n 


(^  omin(R,G,R), 
^  R  +  G  +  B 


assigns  low  saturation  values  to  all  colors  near  white  and  high,  unstable  values  to  most 
colors  near  black.  Saturation  thus  correlates  negatively  with  brightness.)  The  National 
Television  Systems  Committee  YIQ  system. 


Y  =  0.509R  +  l.OOOG  +  0.1945 
/  =  1.0005  -  0.460G  -  0.5405 
Q  =  0.4035  -  l.OOOG  +  0.5975  , 

is  awkward  to  represent  digitally  and  less  useful  than  IHS  for  segmentation  [Laws  85], 
perhaps  because  the  Q  band  contains  very  little  energy  or  information  for  natural  scenes. 
Ohta’s  appro.ximate  Karhunen-Loeve  color  transform  [Ohta  80], 


h^R+G+B 

12  =  5-5 

/3  =  25-(G-1-5), 

is  easier  to  encode,  but  performs  much  like  the  YIQ  system. 

I  use  the  following  vivdness-hue-saturation  (VHS)  system.^  The  bands  are  sufficiently 
decoupled  and  informative,  and  my  methods  of  analysis  sufficiently  matched  to  them,  that 

^No  relation  to  the  VHS  videotape  format. 
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I  have  not  had  to  retain  additional  redundant  color  bands  as  other  researchers  have  done 
[Ohlander  75,  78;  Price  76]. 

Vividness  (V)  is  a  measure  of  intensity  or  brightness,  defined  to  be  the  length  of  a 
color  vector  expressed  as  a  fraction  of  the  maximum  length  it  could  have  within  the  RGB 
color  cube.  (In  other  words,  we  divide  the  distance  from  the  origin  to  a  color  point  by 
the  distance  to  the  intersection  of  this  ray  with  a  cube  surface.  An  equivalent  formula  is 
given  below.)  Thus,  pure  spectral  colors — magenta,  red,  yellow,  green,  cyan,  blue,  and  all 
of  the  intermediate  colors  along  the  edges  of  the  color  cube — have  the  same  vividness  as 
white.  Maximally  bright  desaturated  colors,  such  as  shocking  pink,  are  also  fully  vivid. 

Shifted  hue  {H)  is  the  customary  zero-to-27r  angular  measure,  but  with  the  scale’s  zero 
rotated  from  pure  red  by  rr/S  to  place  the  discontinuity  at  magenta.  This  permits  the 
circular  nature  of  the  hue  scale  to  be  ignored,  since  very  few  objects  in  natural  imagery 
are  magenta.^  My  classification  methods  do  not  require  Gaussian  histogram  peaks,  so  the 
effect  of  splitting  a  histogram  peak  between  the  two  ends  of  the  color  scale  is  minimal. 
(Analysis  methods  that  are  suitable  for  circular  features  could  be  incorporated  if  needed.) 
A  special  code  is  used  to  mark  colors  on  the  achromatic  axis,  for  which  hue  is  meaningless. 

Cylindrical  saturation  {S)  is  the  distance  of  a  color  point  from  the  achromatic  axis, 
expressed  as  a  fraction  of  the  maximum  possible  distance  for  any  vivid  color.  The  nature  of 
the  RGB  cube  constrains  all  colors  near  the  black  or  white  corners  to  have  low  saturation. 
Primary  colors  and  their  complements  have  the  maximum  saturation,  while  other  pure 
spectral  colors  within  the  RGB  cube  are  slightly  below  the  maximum. 

Formulas  for  converting  RGB  sensor  coordinates  to  VHS  coordinates  are  as  follows; 

V  =  max(ii,G,  B) 

T  + 

T  a-rctan 
23r  +  arctan 

TT 

3 

ACHROMATIC 


where  each  condition  on  H  is  tested  in  turn.  H  is  also  taken  modulo  27r,  so  that  amount 
is  subtracted  if  H  exceeds  27r.  (This  can  occur  only  in  the  third  line  of  the  formula, 
which  computes  hue  for  the  bluish-red  third  of  the  chromaticity  triangle.)  V  and  S  can  be 
normalized  to  be  between  0  and  1,  although  the  indicated  formulas  are  more  convenient  if 
integer  results  comparable  to  the  original  RGB  coordinates  are  desired.  I  typically  scale 
H  to  lie  in  the  range  [0,179]  with  ACHROMATIC  =  255.'* 

^Exceptions  may  occur  in  images  of  distant  mountains. 

■‘a  code  of  0  for  the  achromatic  axis  might  be  preferable  for  many  uses,  but  is  misleading  because  the 
achromatic  colors  should  be  considered  equally  close  to  or  far  from  all  other  hues. 


R  >  B,G  >  B 
G>  R 
B>G 
R>  B 
otherwise 
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This  color  space  can  be  viewed  as  a  cone  of  height  having  black  as  the  point  at 
the  bottom,  white  as  the  center  of  the  top  end,  and  the  pure  spectral  colors  distributed 
around  the  top  circle  at  distance  ^max  from  the  achromatic  axis.  The  cone  may  also  be 
envisioned  as  embedded  in  a  cylinder  of  radius  ^max,  witli  black  at  the  bottom  center,  or 
(0,  ACHROMATIC,  0)  position,  but  colors  outside  the  central  cone  are  also  outside  the 
range  achievable  by  most  digital  (or  RGB)  sensors  and  displays. 

Whether  terms  such  as  “vividness”  and  “saturation”  are  psychophysicaUy  realistic 
depends  partly  on  the  imaging  and  display  systems  employed.  I  would  not  expect  pure 
blue  and  green  to  be  as  “vivid”  as  white  and  yellow,  although  they  are  so  treated  by  my 
VHS  system.  Suffice  it  to  say  that  partitioning  and  labeling  performance  is  not  dependent 
on  precise  psychophysical  modeling. 

3.2  Texture  Representation 

Local  texture  properties  are  more  difficult  to  characterize.  DifFerenr  textures  exhibit 
their  distinctive  variations  at  different  spatial  scales.  The  texture  elements  themselves 
(e.g.,  gravel,  blades  of  grass,  plant  stems,  tree  leaves,  roof  tiles)  are  often  below  the  resolv¬ 
ing  power  of  our  imaging  systems,  forcing  us  to  characterize  texture  patterns  statistically. 
One  approach  is  to  determine  the  typical  element  size  or  coarseness  for  any  texture  patch; 
another  is  to  determine  characteristics  that  ai'e  constant  across  all  scales  of  measurement 
[Pentland  84].  My  own  efforts  have  concentrated  on  a  class  of  modified  “texture  energy” 
operators  [Laws  79,  80ab]  that  respond  to  the  density  and  contrast  of  different  spatial 
patterns  in  much  the  same  way  that  simple  edge  detectors  respond  to  image  gradients. 

Although  I  shall  explain  my  approach  in  some  detail,  there  are  many  image  textures 
for  which  brightness  and  local  variance  (especially  the  log  of  local  Gaussian-weighted 
variance  [Laws  88b],  as  discussed  below)  are  the  most  useful  descriptors.  It  is  rare  in 
natural  imagery  for  adjacent  texture  patches  to  have  the  same  brightness  and  variance  (or 
color  and  variance),  yet  differ  in  directionality  or  higher-order  properties.  Sophisticated 
statistical  measures  of  texture  are  thus  unneccesary  for  image  partitioning,  nor  are  they 
often  of  much  additional  help  in  material  labeling. 

I  derive  a  texture  measure  in  two  steps.  (These  are  computed  during  a  single  pass 
through  the  image,  but  are  easier  to  understand  as  two  separate  passes.)  First,  I  convolve 
the  image  with  a  small  center- weighted  matrix  or  mask  such  as  the  3x3  masks  in  Figure  1. 
Next,  I  apply  a  Gaussian-weighted  local  variance  operator  to  the  filtered  image  and  report 
the  log  of  the  variance  as  a  local  texture  descriptor.  In  other  words,  I  compute  log  variance 
of  the  filtered  image  in  a  square  window,  giving  more  weight  to  pixels  near  the  window 
center. 

My  original  texture  energy  measures  used  unweighted  windows,  which  caused  blocky 
artifacts  in  the  outputted  image  around  bright  or  dark  points  in  the  input.  Standard 
deviation  within  the  window  was  approximated  by  a  sum  of  absolute  values — for  zero- 
mean  filter  outputs — but  variance  has  better  theoretical  properties  and  is  almost  as  easy 
to  compute.  Logarithmic  output  helps  match  human  textural  perception  and,  moreover, 
can  be  stretched  to  make  good  use  of  an  8-bit  output  scale.  (Log  variance  also  avoids 
any  square- root  operation  and  differs  from  log  standard  deviation  only  by  a  factor  of 
two.)  I  have  not  conducted  rigorous  experiments,  but  the  subjective  improvement  in 
my  texture  energy  measures  seems  worth  the  small  computational  cost  of  the  Gaussian- 
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Figure  1:  Orthogonal  3x3  Texture  Masks 


weighted  variance  operation. 

A  vector  of  texture  energy  descriptors  can  be  generated  for  each  pixel  by  using  different 
texture  energy  masks  or  different  sizes  of  variance  window;  that  vector  can  then  be  used 
for  material  classification  or  image  partitioning.  Energy-gathering  windows  of  no  more 
than  two  or  three  times  the  width  of  the  filter  mask  are  recommended;  larger  sizes  give 
better  classification  accuracy  when  applied  to  large  texture  patches,  but  lack  the  resolution 
needed  for  analysis  of  typical  imagery. 

In  some  cases  I  have  used  3  X  3  or  even  1x1  variance  windows  (i.e.,  with  no  vari¬ 
ance  computation),  since  my  histogram-based  classification  and  segmentation  methods  do 
not  depend  on  the  inherent  smoothing  of  larger  windows  [Laws  88ab].  (This  raises  inter¬ 
esting  questions  about  the  nature  of  optimal  or  human  texture  perception.  The  “most 
local”  frequency  or  joint-probability  texture  measures  are  typically  the  most  powerful  in 
natural  imagery,  a  fact  that  was  not  appreciated  when  early  comparative  studies  were 
done  [Dyer  76;  Weszka  76].  I  have  yet  to  investigate  the  full  power  of  gradient/curvature 
representations  and  local  histograms  or  information-theoretic  statistics  for  texture  char¬ 
acterization.) 

I  tend  to  work  with  3x3  pattern  masks  in  3  X  3  or  9  x  9  variance  windows  (depending 
on  the  task),  although  most  other  workers  have  chosen  the  larger  set  of  5  X  5  masks  in 
15  X  15  or  31  X  31  unweighted  windows  that  I  originally  proposed.  (Note  that  my  new 
measures  are  even  more  local  than  these  numbers  would  imply,  since  Gaussian  weighting  of 
the  variance  window  reduces  the  effect  of  all  pixels  near  the  window  edges.)  An  alternative 
to  varying  the  mask  and  window  sizes  is  to  compute  identical  texture  energy  measures  at 
each  level  in  a  pyramid  of  image  resolutions  [Larkin  83].  Small  texture  operators  tend  to 
mimic  edge  detectors;  to  exploit  texture  in  natural  images,  however,  local  measures  must 
somehow  be  incorporated. 

I  have  developed  a  particularly  simple  class  of  basis  vectors,  the  lattice  aperture  wave¬ 
form  sets  [Laws  79,  80ab],  derived  by  repeatedly  convolving  the  vectors  [  1,  1]  and  [-1,  1] 
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(or  any  of  the  resultant  vectors)  to  obtain  basis  vectors  of  any  desired  length  and  sequency.^ 
Horizontal  and  vertical  pairs  (possibly  from  sets  of  different  order)  can  be  convolved  to 
form  separable  rectangular  masks,  such  as  those  in  Figure  1,  each  giving  rise  to  a  different 
texture  energy  measure.  The  vectors  can  also  be  convolved  directly  with  an  image  to 
achieve  the  same  effect  as  convolving  with  the  rectangular  mask,  usually  with  a  significant 
computational  saving.  Third-order  and  fifth-order  basis-vector  sets  are 
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(I  have  negated — or,  equivalently,  reversed — the  originally  published  form  of  the  fourth 
fifth-order  vector  to  maintain  consistent  sinusoidal  phase.  Antisymmetric  vectors  should 
have  a  negative  term  just  before  the  central  position.) 

Joint  classification  accuracies  of  the  original  texture  energy  measures  are  listed  in 
Tables  4-2  and  7-5  of  [Laws  80a].  The  3x3  measures  in  15  x  15  windows  were  about 
85%  accurate  in  a  difficult  eight-class  linear-discriminant  problem;  the  5x5  measures  were 
about  87%  accurate;  and  accuracy  of  the  7x7  measures  was  about  88%.  Performance 
remained  at  88%  when  all  83  of  these  measures  were  combined.  A  particular  set  of  32 
co-ocurrence  measures  was  only  71%  accurate  on  the  same  problem.  (I  am  unaware  of 
any  studies  using  quadratic  or  nearest-neighbor  classification.) 

I  seldom  use  more  than  one  or  two  of  the  eight  zero-mean  3x3  filter  masks.  While 
individual  texture-energy  measures  are  less  powerful  than  full  sets  (as  detailed  in  my 
dissertation),  the  following  quotation  [Pietikainen  82,  83]  is  reassuring: 

The  best  [3x3  and  5x5]  Laws  features  correctly  classify  25  out  of  28  samples, 
or  nearly  90%,  a  remarkable  result  for  a  one-feature,  seven-class  task. 

Corresponding  statistics  for  co-occurrence  and  “edge  per  unit  area”  measures  were  71% 
and  68%.  Their  own  modifications  of  texture  energy  achieved  86%,  or  93%  when  a  par¬ 
ticular  nonmaximal-suppression  step  was  added.  (The  nonmaximal-suppression  result 
is  interesting  and  warrants  further  study,  as  does  a  rank  transformation  used  elsewhere 
[Harwood  83].  I  disagree,  however,  with  Pietikainen’s  conclusion  that  the  power  of  texture 
energy  measures  “depends  on  the  general  forms  of  the  masks  (edge-like,  spot-like,  etc.), 
rather  than  on  the  specific  numerical  values  . . .”  While  this  is  true  to  some  degree,  Chap¬ 
ter  6  of  my  dissertation  shows  that  pattern  detectors  of  the  same  general  type  can  vary 
greatly  in  individual  classificatory  power,  and  that  few  approach  the  power — documented 
in  Figure  7-3 — of  the  texture  energy  measures.) 

For  the  record,  the  seventh-  through  eleventh-order  basis-vector  sets  are  shown  in 
Figure  2.  I  use  the  zero-sequency  vectors  for  the  Gaussian  window  weighting  mentioned 
previously.  Two-dimensional  integer  convolution  with  such  vectors  can  be  accomplished  in 
32-bit  registers  (with  minimal  truncation  error)  if  care  is  taken  in  scaling  the  intermediate 

^Sequency  is  the  number  of  interior  zero  crossings  and  is  related  to  frequency.  Convolving  two  basis 
vectors  from  any  of  these  sets  produces  another  vector  with  sequency  equal  to  the  sum  of  the  origincd 
sequencies. 
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Figure  2:  High-Order  Lattice  Aperture  Waveform  Sets 
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results.  These  vectors  differ  only  slightly  from  the  “bandpass  binomial  windows  of  even 
order”  or  the  difference-of-binomial  vectors  investigated  by  Nicholson  [80],  a  particularly 
simple  basis  set  to  implement  as  real-time  digital  filters.  Rectangular  masks  formed  from 
any  of  these  vector  sets  resemble  sampled  2-D  Gabor-filter  masks  [Daugman  84,  85],  said 
to  resemble  the  spatial  response  of  certain  neurons  in  the  human  visual  system.  Many 
other  derivations  of  such  filters  have  been  reported  (Danielsson  80,  85;  Hashimoto  83; 
Knutsson  83;  Tanimoto  78],  including  an  interesting  hexagonal  form  [Nicholson  80]. 

Gabor  filters,  used  in  TD  detection  theory,  are  Gaussian-weighted  sinusoids  with  sine 
and  cosine  phase.  Paired  waveforms  are  combined  to  determine  target  position  and  fre¬ 
quency,  with  minimum  joint  uncertainty  in  the  two  measurements.  Each  Gabor  mecisure- 
ment  is  essentially  a  local  2-D  Fourier  coefficient  measured  at  one  pixel  position.  Ratios  of 
these  coefficients  for  different  frequencies  can  be  used  to  estimate  local  fractal  dimension 
[David  Heeger,  unpublished],  since  true  fractals  have  a  1//  relationship  between  energy 
and  spatial  frequency. 

The  theory  of  separable  2-D  Gabor  filters  leads  to  an  easy  way  of  constructing  rect¬ 
angular  masks  with  any  orientational  sensitivity.  Rectangular  masks  that  approximate 
oriented  2-D  Gabor  filters  with  an  integral  number  of  pixels  per  cycle  along  the  horizontal 
and  vertical  axes  can  be  constructed  by  multiplying  corresponding  terms  of  two  vectors 
with  appropriate  frequencies: 


5(x,2/)  =  5(.T)C(y)  +  C(a:)5(j/) 

C(a:,j/)  =  C(.T)C(2/)-5(a:)5(y), 

where  x  is  the  horizontal  position,  y  is  the  vertical  position,  and  C(x)  are  vectors 
with  sine  (antisymmetric,  or  odd)  and  cosine  (symmetric,  or  even)  phase,  and  S{x,y)  and 
C{x,y)  are  the  two  masks  for  the  orientation  determined  by  the  arctangent  of  the  relative 
cycle  lengths  along  the  two  axes  [Heeger,  unpublished].  If  equal  horizontal  and  vertical 
cycle  lengths  are  used,  the  following  masks  with  45-degree  orientation  are  generated:® 
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Reversing  (or  negating)  one  of  the  sine  vectors  rotates  the  direction  of  maximal  response 
by  90  degrees.  Although  these  diagonal  masks  are  linear  combinations  of  the  masks  in 
Figure  1,  and  corresponding  filtered  images  are  linear  combinations  of  the  original  filtered 
images,  their  local-variance  or  texture  energy  measures  need  not  be  derivable  from  the 
horizontal  and  vertical  texture  energies. 

The  Gabor-filter  approach  differs  from  texture- energy  filtering  in  that  a  sum  of  squared 
sine  and  cosine  responses  is  used.  (An  arctangent  of  the  ratio  may  also  be  computed 
to  obtain  a  phase  or  target-position  estimate.)  Although  provably  optimal  in  certain 
detection  problems,  this  is  not  suitable  as  the  filtering  step  in  texture  energy  computation. 
Gabor  filters  of  a  given  frequency  are  invariant  to  positional  (or  phase)  shift  in  that 
frequency,  making  them  insensitive  to  some  rather  drastic  deformations  in  spatial  target 
signatures.  Gabor  filters  also  tends  to  “lock  onto”  to  strong  image  structures  as  they  are 
scanned  across  the  image,  thus  blurring  the  estimate  of  texture  element  location.  The 

®These  can  obviously  be  scaled  down  by  a  factor  of  2. 


9 


single-mask  outputs  used  in  texture  energy  computation  are  keyed  more  to  edges  and 
spots  than  to  individual  frequencies,  and  are  more  closely  tied  to  the  mask  position. 

Although  the  extra  directional  freedom  of  Gabor  masks  may  be  of  use  in  a  full  vision 
system  (as  it  exists  in  humans),  I  have  not  found  the  extra  texture  energy  measurements 
derived  from  them  to  be  obviously  useful  in  road  tracking.  I  have  also  not  found  fractal 
texture  measures  (computed  as  ratios  of  Gabor-filter  outputs)  to  be  useful,  primarily 
because  the  low-frequency  masks  are  much  larger  than  typical  texture  elements  in  this 
imagery. 

My  experience  is  that  one  should  generally  use  the  lowest-sequency  zero-sum  texture- 
energy  vectors  or,  equivalently,  the  smallest  filter  size  for  a  given  cycle  length.  High- 
sequency  filters  gain  frequency  resolution  at  the  expense  of  spatial  position;  therefore, 
they  perform  badly  in  cluttered  images,  although  they  may  be  useful  when  textures  are 
inherently  repetitive  (e.g.,  roof  tiles  or  row  crops  in  high-resolution  aerial  imagery).  In 
.some  cases,  it  is  desirable  to  sum  the  log  variance  outputs  for  matched  horizontal  and 
vertical  masks  (e.g.,  the  two  Sobel  gradient  masks)  so  as  to  decrease  rotational  sensitivity. 
One  might  also  compute  ratios  of  texture  energy  outputs  to  reduce  sensitivity  to  contrast 
and  illumination  changes — although  sensitivity  to  certain  other  texture  characteristics 
may  also  be  reduced.  Such  normalizations  are  still  a  black  art. 

With  the  possible  exception  of  fractal  measures,  none  of  these  operators  are  stable 
across  different  image  scales;  they  are  thus  difficult  to  use  in  low-angle  perspective  imagery. 
They  also  tend  to  be  too  coarse,  even  for  3  x  3  texture  energy  masks  in  5  x  5  variance 
windows,  to  measure  the  vegetation  textures  that  are  important  for  autonomous  vehicle 
navigation  and  aerial  cartography.  Finally,  all  such  texture  operators  respond  very  strongly 
to  object  borders,  producing  meaningless  results  for  small  or  thin  regions.  Texture  areas 
found  with  these  discriminant  features  must  therefore  be  “grown”  out  to  their  borders  with 
other  regions,  whereupon  small  patches  may  be  missed  (or  misclassified)  entirely.  These 
are  among  the  reasons  for  my  avoiding  elaborate  texture  measures  whenever  possible. 

It  seems  necessary  to  segment  an  image  before  really  meaningful  texture  measures  can 
be  computed.  Fortunately,  this  can  usually  be  done  by  using  pixel  brightness  and  color 
alone.  It  is  rare  (in  these  applications)  to  find  two  adjacent  regions  that  differ  only  in 
texture  and  not  in  local  brightness  or  color.  Segmentation  can  sometimes  be  improved  even 
more  by  including  local  log  variance  (of  the  unfiltered  image)  as  an  additional  descriptor 
band,  although  this  measure  does  respond  to  region  borders. 

Unfoi'tunately,  prior  segmentation  is  not  the  whole  answer  even  when  it  is  feasible.  No 
one  has  yet  developed  a  good  way  to  compute  texture  energy  measures  over  nonrectangular 
regions.  (It  would  be  necessary  to  estimate  the  probable  contribution  of  missing  data  in 
the  filter  window  under  the  assumption  that  it  came  from  the  same  unknown  texture  as 
the  available  data.)  Other  texture  measures,  such  as  the  less  powerful  co-occurrence  or 
joint  probability  measures  [Haralick  73,  79],  are  easier  to  compute  over  irregularly  shaped 
regions  even  though  they  are  less  powerful  and  less  plausible  as  models  of  human  visual 
processing.  An  intriguing  idea,  which  I  have  yet  to  develop,  is  to  use  similarities  and 
differences  of  local  histograms  as  textures  measures.  This  is  related  to  the  classification 
technique  described  below  (also  in  [Laws  85,  88b])  for  extracting  and  identifying  scene 
materials  by  using  only  single- band  histograms. 
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4  Target  Acquisition 


The  initial  step  in  tracking  an  object  or  material  type  is  to  find  it  in  the  first  image. 
In  some  applications  the  position  and  shape  of  the  prototype  are  known  or  can  be  traced 
by  the  user;  in  others  the  prototype  must  be  identified  by  means  of  spectral  signatures  or 
other  knowledge.  An  intermediate  method  is  to  let  the  computer  select  likely  prototypes 
and  have  the  user  indicate  which  are  correct.  I  shall  confine  my  discussion  to  autonomous 
target  acquisition,  although  the  KNIFE  system  does  permit  specification  by  tracing. 

One  approach  is  to  segment  the  initial  image,  then  classify,  group,  and  further  split 
regions  until  a  target  can  be  identified.  My  system  partitions  the  image  into  homoge¬ 
neous  regions,  absorbing  small  fragments  while  keeping  larger  regions.  Large,  distinct 
regions,  such  as  sky  and  road,  are  found  first,  after  which  smaller  details  are  extracted. 
(Identification  becomes  easier  as  background  regions  are  labeled  and  removed  from  con¬ 
sideration.  Spatial  and  spectral  knowledge  obtained  from  other  scene  objects  could  also 
aid  recognition.) 

This  scene  parsing  approach  is  a  major  goal  of  current  research  on  segmentation  and 
scene- analysis.  In  limited  domains,  however,  it  may  be  very  easy  to  find  a  particular  target 
class.  In  road  following,  for  instance,  we  may  be  able  to  assume  that  the  lower  half  of  the 
initial  image  is  mostly  road  surface  (possibly  shadowed  by  the  vehicle  or  other  objects). 
All  regions  that  are  given  the  road  or  shadowed-road  label  can  be  merged,  and  any  holes 
can  be  absorbed  or  passed  to  a  reasoning  system  for  analysis.  Signatures  extracted  from 
these  initial  regions  can  be  employed  in  a  classification  step  to  recognize  additional  parts 
of  the  road. 

Another  approach  is  to  label  pixels  with  stored  signatures  (or  histograms),  then  extract 
connected  regions  that  have  been  assigned  the  target  label  [Laws  88b].  While  this  is  fcister, 
it  is  often  less  reliable  than  a  full  parse.  The  approach  works  if  the  stored  signatures  are 
representative,  which  may  require  that  they  come  from  the  same  imagery  or  that  they 
be  adjusted  for  variations  in  Illumination  or  target  range.  This  is  essentially  the  tracking 
approach;  it  will  be  described  in  more  detail  in  the  next  section. 

A  third  approach  is  to  use  likelihood  maps  to  label  initial  regions.  It  may  be  that 
we  know  approximate  or  hypothesized  object  positions  in  the  image  and  that  we  can 
estimate  an  a  priori  probability  of  any  particular  pixel  position’s  belonging  to  the  target. 
We  can  then  segment  the  image  and  look  for  overlap  between  the  image  segments  and  the 
hypothesized  label  positions.  This  seems  to  work  well  even  when  the  “likelihood  map” 
is  just  a  binary  mask  derived  from  a  synthetic  region  map — i.e.,  when  we  are  “tracking” 
from  a  hypothesized  scene  insteaxi  of  from  a  previous  image.  This  type  of  initial  labeling  is 
trivial  if  the  synthetic  objects  overlap  their  true  image  regions  by  at  least  50%;  otherwise 
much  more  elaborate  constraint-satisfaction  techniques  must  be  employed. 

The  KNIFE  analysis  system  can  be  used  for  any  of  these  approaches,  since  they 
all  require  similar  techniques  for  region  representation,  threshold  analysis,  connected- 
component  extraction,  and  noise  cleairing.  The  acquisition  method  illustrated  in  this 
paper  is  to  start  with  a  reliable  seed  region  and  grow  it  to  obtain  a  larger  prototype 
region  with  an  image-specific  signature.  The  following  section  documents  my  method  of 
tracking  regions  or  material  types  once  a  partial  scene  parse  is  available.  I  assume  that 
signatures  are  known  or  can  be  extracted  for  all  major  scene  entities.  Just  “target”  and 
“background”  signatures  may  be  sufficient,  but  better  partitioning  is  often  possible  when 
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all  visual  classes  are  independently  represented  in  the  training  set. 


5  Classification  Tracking 

The  tracking  problem  is  to  start  with  one  instance  of  an  object  (or  with  an  object 
signature  stored  in  a  database)  and  then  find  a  matching  instance  in  a  new  image.  It 
may  also  be  desirable  to  find  additional  instances  of  an  object  or  material  class  in  the 
new  image,  should  they  be  present.  We  may  even  want  to  use  characteristics  of  the  newly 
identified  instances  to  better  extract  the  original  objects — a  joint-estimation  problem. 

The  tracking  task  is  complicated  by  the  fact  that  regions  in  a  second  image  may  cor¬ 
respond  poorly  to  those  in  the  first.  The  original  objects  may  have  moved,  of  course,  or 
their  images  may  have  been  broken  into  smaller  regions  or  left  combined  with  neighboring 
objects.  This  is  generally  an  insoluble  problem,  so  I  shall  assume  that  the  scene  is  suffi¬ 
ciently  stable  and  the  partitioning  and  labeling  system  sufficiently  good  that  we  can  easily 
determine  region  correspondences.  Relaxation  labeling  algorithms  have  been  developed 
for  establishing  correspondences  in  more  difficult  cases  (Price  76,  81,  82;  Faugeras  81, 
82;  Kitchen  80,  84];  these  are  related  to  search  algorithms  for  labeling  region  clusters 
as  known  objects  [Duda  70;  Freuder  72,  73,  76,  77;  Barrow  76;  Tennenbaum  76,  77; 
Khan  84ab;  Wesley  83,  84,  86). 

The  road-tracking  task  is  simplified  because  imaging  conditions  guarantee  significant 
region  overlap  among  images  (almost  always).  In  fact,  most  of  the  lower  portion  of  the 
image  can  be  assumed  to  be  road  if  the  vehicle  is  operating  normally.  Tracking  trees 
and  other  roadside  objects  (to  permit  map  generation,  for  example)  is  considerably  more 
difficult.  Some  of  this  work  must  be  left  to  higher-level  reasoning  processes,  but  the  road 
tracking  itself  can  almost  always  be  done  with  low-level  classification  and  partitioning 
techniques. 

The  principal  output  from  a  road  tracker  is  the  boundary  of  the  navigable  road  in 
each  image.  This  can  be  augmented  by  identifying  the  center  stripe  or  divider,  wheel 
ruts,  potholes,  shadows,  material  changes,  and  other  minor  markings.  The  boundary 
of  distant  road  sections  may  be  somewhat  uncertain  as  long  as  the  vehicle  has  time  to 
gather  and  analyze  additional  imagery  before  making  navigational  decisions.  The  tracker 
or  associated  reasoning  system  should  also  detect  intersections,  forks,  dead  ends,  and  such 
major  obstacles  as  vehicles,  gates,  shell  holes,  and  missing  bridges.  It  must  also  distinguish 
these  obstructions  from  curves,  dips,  overhanging  branches,  telephone  poles,  traffic  signs, 
and  other  objects  that  do  not  impede  travel. 

Image  regions,  even  when  perfectly  extracted  and  labeled,  do  not  correspond  directly 
to  such  outputs.  A  telephone  pole,  for  instance,  may  block  our  view  of  a  curving  road 
in  such  a  way  that  the  road  is  split  into  two  pieces.  The  best  that  a  low-level,  task- 
independent  vision  system  can  do  is  to  indicate  the  visible  portions  of  the  road,  hypothesize 
a  connection  between  them,  and  note  the  presence  and  characteristics  of  the  occluding 
object.  Establishing  correspondence  between  the  current  regions  and  those  in  past  images 
may  also  help  to  classify  them. 

A  region  graph  shows  how  regions  have  split  or  joined  to  form  other  regions  during 
image  partitioning.  A  similar  region  graph,  which  I  call  a  propagation  graph,  can  be  built 
to  show  how  regions  in  one  image  have  split  or  joined  to  form  regions  in  the  other. 

Pixel  classification  is  at  the  heart  of  material  identification.  Either  pixels  must  be 
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classified  and  then  grouped  into  regions  or  regions  must  be  extracted  and  then  classified. 
Since  materials  can  exhibit  an  infinite  variety  of  shapes,  region  classification  is  again  done 
mainly  by  classifying  the  pixels  and  selecting  a  consensus  label.  When  objects  are  tracked 
from  image  to  image,  such  parameters  as  size,  shape,  position,  and  context  can  be  used 
to  facilitate  region  labeling. 

Many  target  classes  are  composed  of  more  than  one  material  or  surface  appearance. 
The  sides  and  roof  of  a  house,  for  instance,  may  appear  very  different — and  even  the  sides 
may  differ  because  of  detail  and  illumination.  It  is  often  convenient  to  treat  such  an  object 
as  a  unit,  rather  than  model  the  individual  parts.  This  can  reduce  classificatory  power 
if  only  point  properties  (e.g.,  brightness  and  color)  are  used,  but  the  inclusion  of  texture 
measures  may  prevent  a  highly  textured  region  from  being  mistaken  for  a  configuration 
of  homogeneous  regions.  Because  KNIFE  uses  histograms  as  signatures,  such  composite 
objects  can  easily  be  represented.  (Representation  of  mixed  material  classes  by  Gaussian 
parameters — mean,  standard  deviation,  and  perhaps  higher  moments — would  be  almost 
meaningless.) 

6  Example 

The  KNIFE  program  has  only  recently  been  adapted  to  road  tracking,  and  I  have 
yet  to  develop  an  adequate  control  algorithm.  The  following  example  hints  at  the  power 
of  knife’s  classification-based  tracking,  but  it  is  not  yet  a  fuUy  developed  application. 
Sophisticated  monitoring  processes  will  be  needed  to  detect  overmerging  (or  undermerging, 
loss  of  track,  etc.)  and  to  reacquire  a  target. 

Figure  3(a)  shows  frame  number  500  in  a  sequence  of  images  taken  from  FMC’s 
autonomous-vehicle  prototype.  Only  the  VHS  image’s  vividness  band  is  shown,  although 
aU  three  bands  are  used  for  tracking.  The  lower  portion  of  the  image  has  been  selected 
(interactively  and  arbitrarily)  as  a  training  region  for  the  road  material  type.  Figure  3(b) 
shows  the  result  of  KNIFE’s  multiband  region  growing  with  seglevel  set  to  1  [Laws  88a]. 
The  results  are  not  perfect  (because  of  KNIFE’s  linear  model  rather  than  its  being  curved 
or  polynomial),  of  multidimensional  surfaces,  but  are  adequate  and  perhaps  even  typical 
of  road  regions  extracted  without  semantic  guidance. 

I  need  to  track  this  region  into  the  next  available  image  in  the  series.  Frame  519. 
First  I  update  my  knowledge  of  road  signatures  (i.e.,  multiband  histograms),  but  without 
forgetting  the  initial  training  signature.  KNIFE  lets  me  do  this  by  storing  both  the  original 
and  new  road  signatures  as  two  exemplars  of  road  appearance.  I  also  store  the  nonroad, 
or  “other”  signature  so  that  I  can  assign  pixels  in  the  next  image  to  the  more  similar  of 
the  two  semantic  categories. 

Figure  4(a)  shows  the  result  of  this  classification-based  tracking.  The  central  portion 
of  the  road  is  labeled  “other”  and  the  mountciin  face  is  labeled  “road,”  but,  on  the  whole, 
the  tracking  has  been  successful.  The  extracted  regions  are  suitable  for  processing  by  a 
higher-level  reasoning  system. 

Since  KNIFE  is  domain-independent,  I  have  provided  the  higher-level  control  by  issuing 
a  region-growing  command  for  the  largest  road  region.  This  yielded  the  clean  extraction 
in  Figure  4(b),  although  there  are  certainly  scenes  for  which  simple  region  growing  would 
be  inadequate. 

Next,  without  worrying  about  the  misclassified  mountain  face,  I  propagated  the  new 
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(b)  Grown  Road  Region 


(a)  Two-Category  Classification 


Figure  4:  Tracking  Results  for  the  ALV519  Image 
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“road”  and  “other”  signatures  (along  with  the  ones  gathered  previously)  into  Frame  525. 
The  result  is  shown  in  Figure  5(a).  The  central  portion  of  the  road  has  been  misclassified 
here  too,  but  the  bulk  is  correct.  I  ordered  region  growing  on  the  largest  road  region, 
obtaining  the  results  depicted  in  Figure  5(b).  It  is  not  obvious  whether  the  program  is 
correct  or  incorrect  in  assigning  the  two  distant,  dark  patches  to  the  road  class. 

Figure  6  illustrates  the  same  process  applied  to  Frame  533  in  the  sequence.  This  time 
the  jump  is  too  large  or  the  two-class  model  too  impoverished.  KNIFE  does  a  good  job 
of  finding  unshadowed  road,  but  lacks  knowledge  of  shadowed  road  and  so  assigns  it  the 
“other”  label.  Attempts  at  growing  either  of  the  large  road  regions  result  in  overmerging 
and  loss  of  track,  as  shown  in  Figure  6(b). 

I  confess  that  I  had  less  success  with  this  simple  classify/grow  cycle  on  a  different 
sequence  of  images.  I  had  to  use  a  four-class  model — road,  sky,  trees,  and  grass — and 
even  then  lost  track  when  the  road  touched  a  very  similar  portion  of  sky.  Because  of 
tracking  in  natural  imagery  is  difficult,  knowledge- based  control  techniques  are  essential. 
On  the  whole,  though,  KNIFE’s  histogram-based  classification  tracking  works  well  and  is 
fast  enough  to  be  considered  for  real-time  implementation. 

7  Summary 

The  KNIFE  system  is  a  testbed  for  partitioning  and  labeling  techniques.  Just  as  a 
mail  program  permits  editing  and  manipulation  of  messages,  KNIFE  permits  editing  and 
manipulation  of  image  regions.  It  can  be  viewed  as  a  stand-alone  program,  as  part  of  a 
larger  system,  or  as  an  environment  in  which  to  build  other  capabilities. 

This  paper  has  described  the  use  of  pixel  and  region  classification  to  track  objects 
or  material  types  from  one  image  to  another.  Minimal  use  was  made  of  task-specific  or 
target-specific  knowledge  so  as  to  facilitate  modular  construction  of  more  complex  systems. 
Desired  targets  are  described  declaratively  by  their  multiband  signatures  (or  histograms), 
rather  than  procedurally,  thereby  simplifying  database  and  interface  requirements.  These 
signatures  can  be  extracted  automatically  from  known  instances  in  previous  imagery. 

A  large  portion  of  this  paper  was  devoted  to  the  subject  of  color  and  texture  measure¬ 
ment.  Although  no  ultimate  solution  is  available  for  characterizing  such  pixel  properties, 
I  have  specified  particular  techniques  that  work  weU  and  are  easy  to  compute.  These 
provide  a  baseline  for  further  advances  in  texture  classification  and  segmentation. 

Once  signatures  are  available,  histogram-based  techniques  can  be  used  to  label  pixels 
or  regions  in  new  images  (or  new  parts  of  an  initial  image).  Multiband  histograms  for  these 
regions  can  be  used  to  update  the  stored  mate  rial- class  signatures  so  that  specific  objects 
can  be  tracked  into  future  images.  This  approach  is  fairly  invulnerable  to  occlusions  and 
slow  changes  in  object  appearance. 
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(a)  Two-Category  Classification  (b)  Grown  Road  Region 


Figure  6:  Tracking  Results  for  the  ALV533  Image 
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